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Let  p and  q be  probability  densities.^n  the  terminology  of  S.  Kullback,  yie  directed  divergences 
of  p and  q are  given  by  f p(x)  log  (p(x)/q(x))  dx  and  by  the  same  expression  with  p and  q inter- 
changed; the  divergence  is  the  sum  of  the  directed  divergences.  These  quantities  have  applications  in 
information  theory  and  to  the  problem  of  assigning  prior  probabilities  subject  to  constraints.  In  this 
report,  it  is  shown  that  the  directed  divergences  and  their  positive  linear  combinations,  including  the 
divergence,  are  characterized  by  axioms  of  positivity,  additivity,  and  finiteness;  in  the  course  of  the  •) 
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proof,  the  latter  two  are  shown  to  imply  yet  another  axiom:  invariance.  These  axioms  are  fundamental 
in  work  on  prior  probabilities.  It  has  been  claimed  that  they  characterize  only  constant  multiples  of 
the  single  directed  divergence;  that  claim  is  here  refuted. 
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AXIOMATIC  CHARACTERIZATION 
OF  A FAMILY  OF  INFORMATION  MEASURES 
THAT  CONTAINS  THE  DIRECTED  DIVERGENCES 


I . INTRODUCTION 

R.  L.  Kashyap,  in  [1],  has  considered  a system  characterized  by  a vector  Y 
of  random  outputs  and  a vector  A of  certain  other  parameters  such  that  the 
conditional  probability  density  p(Y|A)  is  some  known  function.  The  output  Y 
is  of  direct  concern  and  directly  accessible  to  measurement;  A is  not,  except 
insofar  as  it  affects  Y.  It  is  assumed  that  A is  distributed  with  a 
probability  density  f (A) , which  Kashyap  calls  the  "true"  density.  Together 
with  the  known  density  pfY  | A ) , this  determines  a probability  density  p(Y)  for 
Y.  Neither  f ( A ) nor  p(Y)  is  given,  although  f may  be  subjected  to  certain 
known  convex  constraints.  Without  the  knowledge  that  defines  the  density 
f(  A ) one  has  the  problem  of  assigning  a "prior"  density  b(  A)  that  reflects 
the  knowledge  one  does  have:  the  conditional  density  p(Y  I A ) and  the 
constraints.  Additional  information,  such  as  particular  measurements  of  Y, 
may  be  folded  in  by  Bayes's  theorem.  But  it  is  the  important  problem  of 
choosing  the  prior  that  concerns  Kashyap. 

Among  previously  suggested  guides  to  choosing  a prior  is  the  principle  of 
maximum  entropy  [2]:  the  "most  noncommittal"  prior  is  the  one  that  maximizes 
entropy  subject  to  the  given  constraints.  Kashyap  proposes  an  alternative, 
the  "principle  of  min-max  uncertainity ."  He  defines  an  "uncertainty 
functional"  tp  ( A;Y),  an  information-theoretic  measure  of  the  discrepancy 
between  the  probability  density  for  Y that  we  assign,  defined  by  b(  A ) and 
p(Y  I A ) , and  the  "true"  density,  defined  by  f ( A ) and  pCYjA).  The  proposed 
principle  is  to  choose  b so  as  to  place  the  lowest  possible  upper  bound  on  the 
largest  value  cp  ( A ;Y)  can  take  for  any  f satisfying  the  constraints  — that 
is,  choose  b to  minimize  the  maximum  possible  uncertainty.  The  functional 
is  defined  so  as  to  satisfy  certain  plausible  axioms  of  additivity,  coordinate 
invariance,  and  the  like,  imposed  on  co  and  an  auxiliary  functional^  ; Kashyap 
claims  that  the  axioms  in  fact  determine  cp  uniquely,  up  to  a pair  of 
arbitrary  constants.  That  claim  is  the  substance  of  Theorems  1 and  2 of  his 
paper. 

However,  the  claim  is  erroneous;  the  theorems  are  false.  In  [3]  we  point 
out  the  errors  in  the  two  proofs  and  exhibit  counterexamples:  functionals  that 
satisfy  the  axioms  in  the  paper  but  form  a family  larger  than  the  family  of 
functionals  presented  there.  We  also  correct  Kashyap' s Lemma  1,  which  is 
false  as  stated  but  can  be  fixed  up  well  enough  to  serve  its  purpose.  Our 
concern  here  is  to  correct  Kashyap' s Theorem  1 by  finding  the  complete  set  of 
functionals  determined  by  the  axioms  imposed  on  t . We  also  show  that  one  of 
the  axioms  is  redundant,  being  a consequence  of  the  others. 


Note:  Maniucript  submitted  October  27,  1977. 
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Theorem  1 deals  with  the  special  case  when  the  "true"  distribution  of  A is 
concentrated  at  a single  point  X * (so  f(X)*6  (\-  \ *))  and  the  uncertainty 

functional  «P  (T;  A)  reduces  to  the  fortnijr  (X*;q(Y)),  where 

r 

i|r  (X*;q(Y))  * j d|y  | L [p(y|X*),  q(y)] 


for  some  function  L.  The  axioms  imposed  on  i|r  are: 

Posi tivity:  ^ (\  *;q(Y))  z 0 with  equality  only  if  q(Y)  “ p(Y  !x  *)• 

Invariance:  The  uncertainty  is  invariant  under  nonsingular  linear  trans- 
formations on  A and  Y. 

Additivity:  The  uncertainty  functional  for  a cciposite  of  two  independent 
systems  is  the  sum  of  the  uncertainty  functionals  for  the  two  component 
systems . 

(For  less  abbreviated  statements,  see  (!].) 

Theorem  1 asserts  that  such  a can  only  be  of  the  form 
t (X*,q(Y))  = C-j  d|y|p(y|X*)  log 

q(y) 

for  some  constant  C3>0.  But  actually  the  two-parameter  family 

I (X  *;  q(Yl)  ^ C j dy  q(y)  log  — - 
2*'  P(ylX*) 

+ Csj  dy  p(y|x*)  log 

satisfies  the  axioms,  including  positivity  if  C2  and  C3  are  nonnegative  and 
not  both  zero.  Kashyap  obtains  only  the  second  term.  When  ^2  = = 1, 

Kullback  [4,  pp.  6,7l  calls  the  two  terms  "directed  divergences"  and  their  sum 
the  "divergence." 


Kashyap' s Theorem  2 deals  with  the  general  case,  when  the  "true" 
distribution  of  A need  not  be  concentrated  at  a single  point.  The  uncertainty 
functional  is  assumed  to  have  the  form 

CP  (A  ;Y)  = j d|y|d|x|L  IpfyjxK  P^y),  q(y).  f(X),  b(X)] 
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for  some  function  L (different  from  the  previous  L).  The  axioms  imposed  are: 
Positivity:  cp(A;Y)sj  d|x|  f(X)'^  [X;q(Y)]  2:  0. 

Consistency  with  j : If  f(X)  = 6 (X-X*) , then  <p  ( A ;Y)  = i|t  ( X *;q(Y))  . 
Invariance:  (As  for  | above K 
Additivity:  (as  for  ^ above). 

Theorem  2 asserts  that  such  a cp  can  only  be  of  the  form: 

CP  (A;Y)  = CjC  ff(A)]  + C3j  d|x|  f(X)  t (X;q) 

= CjC  [f(A)] 

*^3!  d|x|  d|yl  f(x)p(y|X  ) log  » 

where  Q is  the  mutual  information: 

C [f(A)]  =J  d|x|d  jyj  f(X)p(y|  X ) log  - ^ 

and  C3  is  as  in  the  previous  discussion^  of  lit  . 

However,  the  axioms  are  satisfied  by  at  least  the  following  5-parameter  family 
of  functionals: 


CD 


(A;Y)  = Cij  dX  dy  f(X)p(y  Ix  ) log -E-Czly 


p(y) 


C2  J dX  dy  f(x)q(y)  log 


X) 


p(yio 


+ C^j  dx  dy  f(x)p(y|x)  log 
+ C4J  dX  dy  f(X)p(y)  log 
+ C5J  dx  dy  f(x)q(y)  log 


We  retain  the  explicit  constant  where  Kashyap  elects  to  set  C3  ■ 1 by 
convention. 


w 


with  all  Kashyap  obtains  only  the  first  and  third  terms.  j 

Our  counterexamples  to  Theorems  1 and  2 leave  open  the  question  whether 
there  are  still  further  functionals  satisfying  the  axioms.  We  will  argue  in  ■, 

Section  III  that  the  two-parameter  family  (1)  is  essentially  the  complete 
solution  for  ij; . We  make  no  claims  of  completeness  for  the  five-parameter 
family  of  functionals  cp  . The  theorem  in  Section  III  actually  states  that  i|f 
is  characterized  by  the  positivity  and  addivity  axioms  alone.  The  invariance 
axiom  is  not  needed  as  a hypothesis  since  it  is  a consequence  of  positivity 
and  additivity,  as  we  show  in  Section  II. 

The  characterization  of  ilr  is  similar  to  the  characterization  of  the 
directed  divergence  given  by  PI.  Kannappan  [5,6].  Kannappan's  result  is 
stated  for  finite,  discrete  distributions  p = (pj^  , ...,  p^^  ) and 

q=(qj^,  ...,qjj),  where  q,  but  not  p,  is  allowed  to  be  incomplete:  that  is, 

'll  ^ equality  is  not  demanded.  For  a discussion  and  other  refer- 

ences, see  the  book  by  Aczel  and  Daroczy  [7]. 


II.  DERIVATION  OF  INVARIANCE  FROM  ADDITIVITY 


At  this  point  we  abandon  Kashyap's  notation.  The  functional  F and 
function  f in  what  follows  correspond  to  his  i|r  and  L. 

In  this  section  we  prove  a theorem  to  the  effect  that  any  functional 
satisfying  the  additivity  requirements  imposed  on  ♦ (■  F)  must  also  satisfy 
the  invariance  requirement,  even  if  we  restrict  attention  to  probability 
densities  p and  q that  nowhere  take  the  value  zero.  The  restriction  sidesteps 
problems  with  division  by  zero  which  would  otherwise  arise,  since  (1)  involves 
p/q  and  q/p.  With  general  densities  p and  q,  we  could  adopt  some  convention 
like  setting  0 log  0/0  = 0,  but  we  defer  consideration  of  such  matters  to 
Section  IV.  We  will  also  see  in  Section  IV  that,  even  with  the  restriction  to 
positive  probability  densities  p and  q,  it  is  impossible  to  avoid  problems 
with  divergent  integrals  that  may  lead  to  infinite  values  of  F.  For  the 
present  we  will  therefore  adopt,  where  appropriate,  the  additional  hypothesis 
F(p,q)  < »,  meaning  that  the  integral  defining  F(p,q)  converges  absolutely  to 
a finite  value.  We  also  add  a mild  finiteness  requirement,  F(p,p)<  * , to  the 
axioms.  It  then  becomes  possible  to  prove  that  in  fact  F(p,p)  = 0.  We  can 
even  prove  that  f(t,t)  = 0 for  all  t > 0;  this  is  one  of  the  lemmas  used  in 
the  proof  of  the  theorem. 

Theorem  A. 

. 2 

Let  f be  a function  of  two  real  variables,  and  define  a functional  F by 
setting 


F(p,q)  f(p(x),  q(x))  dx 

whenever  p and  q are  positive  probability-density  functions  on  a linear 
space  . Suppose  F satisfies  the  following  two  axioms. 

Finiteness . F(p,p)<  «. 

Additivity.  If  the  space  on  which  p and  q are  defined  is  the  product  of 
two  linear  spaces  X'  and  X",  and  p and  q have  the  product  form 

p(x',  x")  = p'(x')p"(x") 

qfx',  x")  = q'(x’)q"fx") 

in  terms  of  probability-density  functions  p' , q'  on  X'  and  p" , q"  on  X" , 
then 


F(p,q)  = F(p',q')  + Frp",q"). 


We  write  function  for  real , measurable  function;  linear  space  for 
finite-dimensional , real  linear  space;  and  set  for  measurable  set . 
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Then  F also  satisfies  the  following  axiom,  at  least  if  F(p,q)<  <*>  or 
F(p  ’ ,q  ’ 1 < <*>  . 

Invariance.  If  T is  a nonsingular  linear  transformation  on  the  domain  of 
p and  q and  we  define  probability  densities  p'  and  q*  by 

p'^x)  = p(Tx)  det  T, 

q' (xl  = q(Tx)  det  T, 


then 


F(p'  ,q' ) = F(p,q). 


Lemnig  A . 1 . 


Assume  the  hypotheses  of  Theorem  A.  Let  p'  and  q'  be  positive  probability 
densities  wi th  F(p'  ,q' ) < ® . Then 

^ f(p'(x),  q'(x))  dx  - (l/s)j  f(sp' (x) ,sq' (x))  dx  = f(l,l)  - f(s,s)/s 

for  every  real  number  s > 0. 

Lemma  A . 2 


Assume  the  hypotheses  of  Theorem  A.  Then 
f(t,t)  = 0 


for  every  real  number  t > 0. 


Proof  of  Lemmas . 

To  get  information  about  f(t,t)  for  some  particular  real  numbers  t,  we 
will  apply  the  additivity  axiom  to  density  functions  that  assume  constant 
values  on  certain  intervals.  Consider  any  two  numbers  u > 0 and  s > 0. 

Define  p]^"(x")  = q]^"(x")  = u when  0 S x"  S a,  where  a is  chosen  so  that 
au  < 1 . Off  the  interval  A = f0,a]  , define  pj^"  in  any  way  that  makes  pj^"  a 
positive  probability-density  function  on  the  real  numbers;  let  qj^"  = P]^". 
Define 


Pp'^x")  = q2"('x'') 

(see  Figure  1).  Then  P2" 
additivity  axiom. 


Pl"(x") 


su. 


x"  < 0 

0 £ x"  s a/s 


Pl'’( 


a + X 


II  ^ 


a/s) , 


a/s  < x" 


q2"  is  a positive  probability  density.  By  the 
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_ dx''J  dx'  f(p'(x’)p  "(x"),  q'(x’)pj^"(x")) 


+ a j dx*  f(up'(x'),  uq'(x' 


)) 


dx'  f(p'fx'),  q'(x')) 


+ j-  dx"  f(pj^"(x"),  Pj^”(x"))  + af(u,u) 


and 

f dx"  ’ dx'  f(p'(x')p  "(x"),  q'(x')p  "(x")) 

JA/s  J ^ ^ 

+ (a/s)  J dx"  f(sup'(x'),  suq'(x')) 


= j dx'  f(p'(x'),  q'(x')) 


+ 1 dx' 

J A/s 


f(p2"(x"). 


P2"(x"))  + ( a/s) f ( su,su) , 


where  A and  A/s  are  the  complements  of  the  intervals  [0,  a]  and  [0,  a/s].  The 
first  term  on  the  left-hand  side  and  the  first  two  on  the  right-hand  side  of 
the  first  equation  cancel  with  the  corresponding  three  terms  of  the  second. 
Subtracting,  we  find 

aj  f(up'(x'),  uq'(x'))  dx'  - (a/s)  J f(sup'(x'),  suq'(x'))  dx' 


= af(u,u)  - (a/s) f ( su, su) . (2) 

Writing  x for  the  variable  of  integration  and  setting  u = 1 gives 

J f(p'(x),  q'(x))  dx  - (l/s)j  f(sp'(x),  sq'(x))  dx 

= f(l,l)  - f(s,s)/s  . (3) 

This  establishes  Lemma  A.1. 

Now  take  any  probability  densities  p,q,p',q'  related  by 
p(x)  = up'(ux), 
q(x)  = uq'(ux), 
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and  Such  that  F(p,q)<  ®,  F(p',q')<  ®.  By  Lemma  A.  1 , equation  (3)  still  holds 
when  we  replace  p'  and  q'  by  p and  q,  since  p and  q satisfy  the  hypotheses  as 
well  as  p'  and  q' . Therefore,  we  may  substitute  up'(ux)  and  uq'(ux)  for  p'(x) 
and  q'(x)  in  (3).  We  get 

I 

= f(l, 1)  - f(s,s)/8. 

But  a change  of  variables  x'  = ux  in  (2)  gives 

u J f(up'(ux),  uq'fuxl)  dx  - (u/s)  J f ( sup' ( ux) , suq' ( ux) ) dx 
= f(u,u)  - f(su,su)/s 
By  the  last  two  equations, 

f(l,l)  - f(s,s)/s  = f(u,u)/u  - f(su,su)/su. 

Define  the  function  h by 

h(t)  = f(t,t)/t  - f(l,l)  ; 


f(up'(ux),  uq'(ux))  dx  - (1/s)  J f(sup'(ux),  suq'(ux))  dx 


then 


hCsu)  = his)  + h(u) 

The  solutions  of  this  equation,  aside  from  non-measurable  functions,  are  all 
of  the  form 

h(t)  = a log  t 

for  some  constant  a;  thus 

f(t,t)  = at  log  t + bt, 

where  b = f(l,l).  Thus, 


F(p,p)  = aj  p(x)  log  p(x)  dx  + bj 


p(x)  dx  . 
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If  a 0,  we  can  violate  the  finiteness  axiom  by  taking  a positive  proba- 
bility density  p for  which  the  first  integral  diverges;  therefore,  a = 0, 
and  F(p,p)  = b.  Now  the  additivity  axiom  implies  that  b = b + b;  therefore, 
b = 0.  Thus, 

ff  t,t)  = 0. 

This  establishes  Lemma  A. 2. 

Proof  of  Theorem. 

For  any  positive  probability  densities  p and  q and  any  nonsingular  linear 
transformation  T (all  on  the  same  linear  space),  define 

p' (x)  = p(Tx)  det  T 

q' (x)  = q(Tx)  det  T, 

and  set  s = 1/det  T.  We  may  assume  F(p',q')<  oD;if  not,  exchange  the  roles  of 
the  primed  and  the  unprimed  quantities  and  replace  T by  its  inverse.  By  the 
two  lemmas. 


J f(p'(x),  q'(x))  dx  = (1/s)  J f(sp'(x),  sq'(x))  dx. 


But  then 


J f(p'(x),  q'(x))  dx  = J f(p(Tx),  q(Tx))  (det  T)  dx. 


The  left-hand  side  is  F(p',q').  A change  of  variables  makes  the  right-hand 
side  F(p,q).  Therefore  the  invariance  axiom  holds. 


III.  THE  MAIN  THEOREM 


In  this  section  we  prove  a theorem  that  characterizes  all  the  functionals 
F that  satisfy  the  positivity  and  additivity  axioms  (together  with  finiteness 
of  F(p,p)).  As  a corollary,  we  obtain  a similar  theorem  in  which  the 
positivity  condition  F(p,q)  ^ 0 is  replaced  by  a semiboundedness  assumption 
Ffp.q)  s F(p,pl. 

Theorem  B. 

Let  f be  a function  of  two  real  variables  and  define  a functional  F by 
setting 

>% 

F(p,q)  = J f(p(x) ,q(x))dx 

whenever  p and  q are  positive  probability-density  functions  on  a linear 
space.  Suppose  F satisfies  the  finiteness  and  additivity  axioms  of  Theorem  A 
together  with  the  following  axiom. 

Positivity.  F(p,q)  s 0 with  equality  only  if  p = q. 

Then  F has  the  form 


F(p,q) 


log  (q(x)/p(x) )dx  + C 


log  (p(x)/q(x))dx 


for  some  constants  B,C  ^ 0,  not  both  zero. 
Outline  of  Proof. 


Before  embarking  on  the  proof,  we  state  and  prove  a lemma  that  will  be 
used  several  times.  The  proof  proper  has  been  divided  into  four  steps.  The 
first  uses  the  lemma  and  the  invariance  axiom,  available  by  virtue  of  Theorem 
A,  to  show  that  f can  be  written  as 

f(u,v)  = g(u/v)u  + D(u-v)  log  V 

in  terms  of  a constant  D and  a function  g of  one  variable.  The  second  step 
uses  the  additivity  axiom  and  invokes  the  lemma  twice;  the  conclusion  is  that 
a certain  expression  involving  g depends  on  only  two  of  the  three  variables  it 
contains.  Namely, 

(gitu)  - g(t)  - gfu)]  tu/( 1-t) ( 1-u) 

- [gitvl  - g(t)  - g(v)]  tv/(l  t)(l-v) 


is  independent  of  t.  The  third  step  proceeds  to  a general  form  for  f: 

f(u,v)  = A(u  - v)  + B V log  (v/u) 

+ C u log  (u/v)  + D(u  - v)  log  V. 
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The  last  step,  with  the  help  of  the  positivity  axiom,  eliminates  A and  D and 
shows  that  F(p,q)  has  the  form  stated  in  the  conclusion  of  the  theorem. 

Lemma  B . 


I 


I 

1 

I 


Let  h be  a function  of  one  real  variable.  Suppose  the  equation 
J h(p(x)/q(x))  p(x)  dx  = 0 

holds  for  all  positive  probability-density  functions  p and  q,  on  some  linear 
space,  that  satisfy  F(p,q)<  <*>  . Then  h has  the  form 

hft)  = K(l/t  - 1) 

for  some  real  number  K and  all  t >0. 

Proof  of  Lemma. 

Let  q be  a positive  probability  density  on  the  given  linear  space.  Let  u 
and  V be  any  two  positive  numbers  with  u>  Is  v>0  and  set  r = (1  - v)/(u  - v). 
Then  Os  r < 1 and  ru  + (1  - r)v  = 1.  Choose  a bounded  set  M,  a subset  A, 

and  a real  number  m with 

I q(x)  dx  = m > 0 
•'  M 


and 


define 


rm; 


p(x) 


(xeA) 
(xeM-A) 
(xeH)  , 


where  M-A  is  the  relative  complement  of  A in  M and  H is  the  complement  of  M. 
Then  p is  a positive  probability  density,  since 


p(x)  dx  + p(x)  dx  + 1 p(x)  dx 

M-A  M 


u q(x)  dx  + V q (x)  dx 

*'  M-A 


dx 


= rmu  + (1  - r)mv  + (1-m) 


= 1. 
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Moreover,  F(p,q)  <»  , since 

F(p,q)  f(p(x),q(x))  dx  + J _f(q(x) ,q(x))  dx, 

• M M 

where  the  first  integral  is  over  a bounded  set  and  the  second  vanishes  by 
Lenma  A. 2.  Thus,  p and  q are  positive  probability  densities  on  the  given 
linear  space  and  satisfy  F(p,q)  < ® . By  hypothesis,  this  implies 

J h(p(x)/q(x))  p(x)  dx  ■ 0. 

Now 

J h(p(x)/q(x))  p(x)  dx 

= r h(u)  u q(x)  dx  + [ h(v)  v q(x)  dx  +f  _h(l)q(x)  dx 
J A •*  M-A  •'  M 

= rmuh(u)  + (1  - r)mvh(v)  + (1  -m)h(l). 

Therefore 

muh(u)  + mvh(v)  + (l-m)h(l)  = 0. 

u-v  u-v 

When  V = 1,  this  implies 

h(l)  = 0. 

When  V f I,  it  implies 

h(u)  u/(l  - u)  =*  h(v)v/(l  - v)  , 

and  then  both  sides  equal  some  constant  K,  since  one  side  is  independent  of  u 
and  the  other  is  independent  of  v.  Consequently, 

h(u)  = K(l/u  - 1), 

h(v)  = K(l/v  - 1) 

whenever  u > 1 and  0 < v s 1.  Therefore 
h(t)  = K(l/t  - 1) 

for  every  t > 0. 
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Proof  of  Theorem. 


Step  1.  Theorem  A implies  that  the  invariance  axiom  holds  for  p and  q 
such  that  F(p,q)  < <*>  . To  get  information  about  f(u,v)  for  some  particular 
real  numbers  u and  v,  we  will  apply  the  axiom  to  density  functions  that  assume 
constant  values  on  certain  intervals. 

Let  t > 0.  Multiplication  by  t is  a nonsingular  linear  transformation  T 
in  one  dimension:  Tx  = tx.  The  axiom  in  this  case  implies 

J f(p(x),  qCx))  dx  = J f(tp(txl,  tq(tx))  dx 

= J f(tp(xl,  tq(x))(l/t)  dx. 

where  the  second  equality  results  from  a change  of  variables:  (l/t)x  for  x. 

Let  u > 0 and  s > 0.  Define  p(x)  = u and  q(x)  = 1 when  0 ^ x ^ a,  where  a 
is  chosen  so  that  au  < 1 and  a < 1.  Off  the  interval  [0,a]  define  p and  q in 
any  way  that  makes  them  positive  probability-density  functions  on  the  real 
numbers  and  makes  F(p,q)  finite.  Define 


r p(x) . 

X < 0 

\ su, 

0 s X ^ a/s 

vp( a + X - a/s) , 

a/s  < X 

f qfx) , 

X < 0 

r' 

0 i X i a/s 

VqCa  + X - a/s) , 

a/s  < X 

1 and  the  proof  of 

Theorem  A) . 

Then  J p' (x)  dx  = J q'(x)  dx  = 1 , since  the  integrals  of  p'  and  q'  over  the 
subintervals  (-<*>,  0),  [0,  a/s],  and  (a/s,  ®)  are  respectively  equal  to  the 
integrals  of  p and  q over  f-®,0),  [0,a]  , and  (a,  “).  Thus  p'  and  q'  are 
positive  probability-density  functions.  Moreover,  F(p',q')<“  . We  have 


J f(p(x),  q(x))  dx  - J f(p'(x),  q'(x))  dx 
= af(u,l)  - af(su,s)/s, 

since  the  parts  of  the  first  integral  outside  [0,a]  cancel  the  parts  of  the 
second  outside  [0,a/s].  Similarly, 
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'1 


J f(tp(x),  tq(x))  (1/t)  dx  - JfCtp'Cx),  tq'(x))  (1/c)  dx 

“ af(tu,t)/t  - af(stu,8t)/8t. 

Therefore,  by  the  invariance  axiom, 
f(u,l)  - f(8U,a)/8 

“ f(tU,t)/t  - f(8tU,8t)/8t. 

For  any  given  fixed  poaitive  value  of  u,  define 
kCs)  * f(u,l)/u  - f(8U,8)/8U  ; 
it  follows  that 

k(s)  + k(t)  = k(st)  , 

for  arbitrary  8,t  > 0,  and  k is  therefore  of  the  form 
k(s)  = h log  s, 

where  h is  a real  number.  We  write  h(u)  for  h,  since  h may  depend  on  the 
given  value  of  u.  Then,  for  any  positive  numbers  u and  v,  we  have 

f(u,l)/u  - f(su,s)/8u  = h(u)  log  s. 

Define 

g(u)  = f(u,l)/u  ; 

then,  substituting  u/v  for  u and  v for  s,  we  find  that  f has  the  general  form 
f(u,v)  = g(u/v)u  - h(u/v)  u log  V 
for  arbitrary  u,v  > 0. 

Now  the  invariance  axiom  implies 


J Ig(p(x)/q(x))  - h(pfx)/q(x))  log  q(x)]  p(x)  dx 

“J  [g( tp(x)/tq(x) ) - h( tp(x)/tq(x) ) log  tq(x)]  tp^x)(l/t)  dx. 


1 


which  simplifies  to 


(log  t)J  h(p(x)/q(x))  p(x)  dx  ■ 0. 


p 

r, 

L* 


I 

1 


For  any  given  fixed  p',q'  as  above,  define  a new  h by  setting 
hft)  -J  [g(tp'(x’)/q’(x')) 

- g(p'(x')/q'(x'))  - g(t)l  p'(x')  dx'. 

Then 

J h(p"(x")/q"(x"))  p"(x")  dx"  - 0. 

By  the  lenma,  h(t)  ■ K(l/t-l)  for  some  real  number  K (which  may  depend  on 
the  given  p'  and  q').  Whatever  the  value  of  K,  we  have 

h(l)  - 0 , 

uh(u)/(l  - u)  “ vh(v)/(l  - v) 
when  u y 1 and  v V 1.  The  first  of  these  equations  implies 
g(l)  “ 0. 

I 

The  second,  when  expanded,  becomes 

Jj(g(up'(x')/q'(x’))  - g(p'(x')/q'(x’))  - g(u)l  u/(l-u) 

- fg(vp'(x')/q'(x'))  - g(p’(x')/q'(x'))  - g(v)]  v/(l-v)  I p'(x')  dx'  = 0. 

This  holds  whenever  p*  and  q'  are  positive  probability-density  functions  with 
F(p'  ,q' ) <“  . 

Now,  for  any  given  fixed  values  of  u and  v (positive,  not  1),  define  a new 
h: 

h(t)  “ [g(tu)  - g(t)  - g(u)]  u/(l-u) 

- [g(tv)  - g(t)  - g(v)]  v/(l-v). 

Then 

Jh(p'(x')/q'(x'))  p'(x’)  dx'  - 0. 
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By  the  lenma,  h(t)  “ K(l/t  - 1)  for  some  real  number  K.  We  write  K(u,v)  for 
K,  since  K may  depend  on  the  given  values  of  u and  v.  Then 

[g(tu)  - g(t)  - g(u)]  tu/(l-t)(l-u) 

- Ig(tv)  - g(t)  - g(v)]  tv/(l-t)(l-v) 

= K(u,v) 

for  any  positive  t,  u,  and  v different  from  1. 

Step  2*  Define 

k(u,v)  = [g(uv)  - g(u)  - g(v)]  uv/( 1-u) (1-v) . 

The  result  of  step  2 is 

k(t,u)  - k(t,v)  >*  K(u,v). 

Therefore,  for  t “ v and  t » u,  respectively, 

k^v,u)  “ k(v,v)  * K(u,v)  “ k(u,u)  - k(u,v). 

But  k(u,v)  * k(v,u)  ; hence 

2k(u,v)  = k(u,u)  + kfv,v). 

Define  a function 

a(u)  * k(u,u)(l  - u)/2. 

Then 

k(u,v)  ••  a(u)/(l  - u)  + a(v)/(l  - v) . 

By  the  definition  of  k(u,v), 

uv  tg(uv)  - g(u)  - g(v)]  * a(u)(l  - v)  + a(v)(l  - u) . (4) 

(This  holds  even  when  u “ 1 or  v “ I , since  g(l)  ■ 0.)  Substituting  t for  u 
and  uv  for  v leads  to 

tuv  [g(tuv)  - g(t)  - g(uv)]  * a(t)(l-uv)  + a(uv)(l-t); 
this,  with  the  previous  equation,  gives 
tuv  [g(tuv)  - g(t)  - g(u)  - g(v)] 

* a(t)(l  - uv)  + a(uv)(l-t)  + t a(u)(l  - v)  + t a(v)(l  - u). 
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Consequently, 


tuv  [g(tuv)  - g(t)  - g(u)  - g(v)] 

- a(t)0  - uv)  - a(u)(l  - tv)  - a(v)(l  - tu) 

= a(uv)(l  - t)  - a(u)(l  - t)  - a(v)(l  - t) 

= a(tu)(l  - v)  - a(t)(l  - v)  - a(u)(l  - v) , 

where  the  second  equality  is  due  to  the  synmetry  of  the  left-hand  side  in  t, 
u,  and  V.  It  follows  that 

a(tu)  - a(t)  - a(u)  _ a(vu)  - a(u)  - a(v)  _ a(tv)  - a(t)  - a(v)  , 

(1  - t)(l  - u)  “ (1  - u)(l  - v)  “ (1  - tWl  - v) 

where  the  second  equality  is  again  due  to  symnetry.  The  left-hand  side  is 
thus  independent  of  t and  u;  it  is  therefore  a constant,  which  with  foresight 
we  denote  A/2: 


a(tu)  - a(t)  - a(u) 
(1  - t)(l  - u) 


A/2 


Hence 

a(tu)  - aft)  - a(u)  = (A/2)  (-(1  - tu)  + (1  - t)  + (1  - u)]. 

This  can  be  written  as 
bf  t)  + b( u)  = b( tu)  , 
where  the  function  b is  defined  by 
b(t)  = -a(t)  - (A/2)(l  - t). 

It  follows  that 

b(t)  = B log  t 

for  some  constant  B.  Thus  (5) 

a(t)  = -(A/2)(l  - t)  - B log  t. 

Define  a function 

c(t)  = g(t)  + A(l/t  - 1)  + B(log  t)/t. 
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We  have 


i 

t 

I 

1 

i 

1 


\ 

I 

I 

I 


I 


uv  [c(uv)  - c(u)  - c(v)] 

“ uv  tg(uv)  - g(u)  - g(v)] 

+ Auv  [d/uv  - 1)  - (1/u  - 1)  - (1/v  - 1)1 
+ Buv  (dog  uv)/uv  - dog  u)/u  - dog  v)/v] 

= a(u)d  - v)  + a(v)d  - u) 

+ Ad  - u)d  - v) 

+ Bdog  u)d  - v)  + Bdog  v)d  - u) 

= 0 

with  the  help  of  (4)  and  (5);  thus  c(uv)  - c(u)  - c(v)  = 0.  Therefore 

c(t)  = C log  t 

for  some  constant  C.  This  yields 

g(t)  = -Ad/t  - 1)  - Bdog  t)/t  + C log  t. 

Finally,  by  the  result  of  Step  1, 

f(u,v)  = A(u-v)  + B V log  (v/u) 

+ C u log  (u/v)  + D(u-v)log  V. 

Step  4.  By  the  result  of  step  3, 

F(p,q)  = aJ  (p(x)  - q(x))dx  + bJ  q(x)  log  (q(x)/p(x))  dx 

+ cj  p(x)  log  (p(x)/q(x))  + dJ  (p(x)  - q(x))  log  q(x)  dx. 

The  first  term  on  the  right  is  zero  and  may  be  deleted,  since 
J p(x)  dx  = 1 =J  q(x)  dx.  The  coefficients  B,  C,  and  D of  the  remaining 
terms  are  restricted  by  the  positivity  axiom.  In  fact,  D must  be  zero. 

To  show  this,  we  start  by  choosing  functions  q and  r that  satisfy  three 
requirements: 

d)  q + tr  is  a positive  probability  density  for  all  t in  some 
neighborhood  of  0, 

(2)  J r(x)  log  q(x)  dx  f 0, 

(3)  differentiation  under  the  integral  sign  is  permissible  in  computing 
dF(q  + tr,  q)/dt. 
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The  first  requirement  implies  that  q is  a positive  probability  density  and 
that  J r(x)  dx  “ 0.  The  differentiation  gives 


dF(q  + tr,  q)'  ^ 
dt 


- B r ^ dx 

J qCxl  + trlx) 


+ c[  r(x)  [l  - log  aii) 1 dx 

J ® q(x)  + tr(x)  -J 

+ dJ  r(x)  log  q(x)  dx 
When  t • 0,  this  reduces  to 

(C  - B)J  r(x)  dx  + Djr(x)  log  q(x)  dx 

The  first  of  the  three  requirements  implies  that  the  first  term  vanishes.  The 
second  requirement  implies  that  the  second  term  is  nonzero  unless  D = 0. 

Suppose  D 0,  Then,  when  t = 0,  we  have  F(q  + tr,q)  = F(q,q)  = 0,  but 
dF(q  + tr,q)/dt  ^ 0.  It  follows  that  F(q  + tr,q)  assumes  both  positive  and 
negative  values  as  t varies  in  some  neighborhood  of  0.  This  contradicts  the 
positivity  axiom.  Therefore  D = 0.  Thus 


F(p,q)  = Bj  q(x)  log(q(x)/p(x) ) dx 

+ cj  p(x)  log  (p(x)/q(x))  dx 


Both  integrals  are  nonnegative  and  vanish  only  if  p = q.  By  suitable  choice 
of  p and  q,  either  integral  may  be  made  arbitrarily  large  in  comparison  with 
the  other.  Therefore,  the  positivity  axiom  requires  that  B and  C be  both 
nonnegative  and  at  least  one  positive. 

Corollary. 

Theorem  B remains  true  if,  in  the  hypotheses,  the  positivity  axiom  is 
replacedwith  the  following  axiom. 

Semiboundedness . F(p,q)  ^ F(p,p)  with  equality  only  if  p = q. 


Proof . 

Assume  the  modified  hypotheses.  By  Lemma  A. 2,  F(p,p)  = 0;  consequently, 
the  semiboundedness  axiom  implies  the  additivity  axiom.  The  unmodified 
hypotheses  of  Thoerem  B are  thus  satisfied.  Therefore  the  conclusion  holds. 
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IV.  DISCUSSION 


The  theorem  just  proved  shows  that  in  computing  the  functional 
F(p,q)  = J f(p(x) ,q(x))dx,  we  may  take 

f(u,v)  = B V log  (v/u)  + C u log  (u/v)  (6) 

when  u > 0 and  v > 0.  The  possible  extension  of  F to  densities  p and  q that 
may  assume  the  value  0 was  deferred  to  this  section.  The  expression  for 
f(u,v)  prepares  one  to  expect  infinities  to  crop  up  when  the  arguments  of  f 
become  0. 

As  a matter  of  fact,  the  functional  F,  though  not  the  function  f,  may 
become  infinite  even  when  we  keep  the  restrictions  p(x)  > 0,  q(x)  > 0. 
Kullback  [4,  pp.  6,  10]  points  out  the  problem  and  gives  some  examples.  Here 


is  one  more. 

Let 

p(  x)  = 

exp(- 

-1  , 

2r 

q(x)  = 

TT  (1 

+ X ) 

Then  one  directed  divergence  is  infinite  and  the  other  is  finite: 
[ q(x)  log(q(x)/p(x))  dx  = “ , 


t' 

j p(x)  log('pfx)/q(x)) 
Setting 

p'(x)  = q'(-x)  = I 

gives  densities  p'  and  q' 


dx  < » . 

p(x) , X > 0 
q(x),  X ^ 0 

whose  directed  divergences 


are  both 


infinite . 


Granted,  the  integrals  may  diverge,  but  at  worst  they  evaluate 
unambiguously  to  + » ; the  indeterminate  form  » - as  does  not  occur.  Namely 
the  negative  part  of  the  integral  — the  integral  over  the  set  where  the 
integrand  is  negative  — is  finite.  Consider?  p(x)  log(p(x) /q(x) ) dx,  for 
example.  We  have  p(x)  log(p(x)/q(x))  ^ (-l/e;q(x),  since  the  minimum  of 
t log  t for  t > 0 is  the  value  -1/e  at  the  point  t = 1/e.  Therefore,  the 
negative  part  of  the  integral  is  not  less  than  -1/e. 

The  axioms  thus  remain  meaningful  for  infinite  values  of  F.  For  instance, 
the  equation  F(p',q')  = F(p,q)  in  the  invariance  axiom  implies  that  if  either 
of  the  quantities  F(p',q')  and  F(p,q)  is  infinite,  then  so  is  the  other.  With 
f defined  as  in  (6),  one  can  check  that  the  axioms  not  merely  remain  meaning- 
ful, they  are  actually  satisfied  for  all  positive  probability  densities  p and 

q- 
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We  now  come  to  the  question  of  probability  densities  p and  q that  may  take 
the  value  0.  Is  it  possible  to  define  f(0,0),  f(u,0),  and  f(0,v)  for  positive 
real  numbers  u and  v so  that  the  axioms  still  hold?  The  answer  is  no  if  we 
continue  to  insist  that  f be  a real-valued  function.  It  can  be  shown  that  if 
C >0  in  (6),  then  the  axioms  are  inconsistent  with  having  f(u,0)  < » for  u > 0. 
Likewise,  if  B > 0 in  (6),  then  the  axioms  rule  out  f(0,v)  < od  for  v > 0.  If 
we  permit  f to  take  an  infinite  value  when  one  or  both  of  its  arguments  are 
zero,  we  may  adopt  the  convention 

U log  (u/0)  = 00 

when  u >0.  This  convention  is  natural,  since  u log(u/t)  tends  to  oo  as  t 
approaches  0.  Similarly,  the  conventions 

0 log  (0/u)  = 0 

0 log  (0/0)  = 0 

are  the  natural  ones. 
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